Author

Education Atlas

Published

December 27, 2024

1. General Information About Data

The Turkish Statistical Institute (TÜİK) is motivated to ensure the production, publication and distribution of necessary statistics while compiling Turkey’s education data and information. TÜİK collects data from individuals, households, workplaces through surveys and censuses. The education data it collects helps to compare education information and gives an idea of ​​what education needs a country has for its development.

Data Source: TÜİK

2. Why We Choose Data and Our Purposes

As part of the EMU430 course, we decided to examine the National Education Statistics dataset published by the Turkish Statistical Institute (TÜİK). This dataset provides comprehensive information about the state of education in Turkey. It is particularly suitable for analyzing issues such as reading rates, gender differences in education, and regional education levels.

Education plays a key role in the development of a society. Therefore, by choosing this dataset for our project, we aim to determine the strengths of Turkey’s education system and areas that need improvement. We plan to make important inferences about education with this data.

The main goal of our project is to conduct an in-depth analysis of the state and trends of education in Turkey. In this context, we will seek answers to the following questions:

Are there differences in education levels between genders? Has education equality been achieved regionally? How has the level of education in Turkey progressed over the years?

By answering these questions, we aim to understand the problems in the education system and develop suggestions that can contribute to education policies in Turkey.

3. Data Preprocessing and EDA Analysis


title: “Data Preprocessing Report” author: “Education Atlas” date: today format: html: toc: true toc-depth: 2 code-fold: true code-summary: “Code Details” execute: echo: true warning: false error: true


📊 Introduction

This project aims to analyze educational data from 2008 to 2023. The following preprocessing steps prepare the dataset for analysis:

  1. Data loading and removal of empty columns
  2. Partitioning data by years
  3. Column header adjustment
  4. Creating multi-level column headers

🛠️ Step 0: Manual Data Preprocessing in Excel

Before starting data analysis, the dataset was manually cleaned in Excel to ensure consistency and accuracy:

  1. Empty Columns Removed: Completely empty columns were deleted.
  2. City Name Standardization: Inconsistent city names (e.g., Afyonkarahisar → Afyon, İçel → Mersin) were corrected.
  3. Unnecessary Rows Cleared: Rows containing summary totals (e.g., labeled as Türkiye) and metadata were removed.
  4. Header Adjustments: Column headers were standardized, and missing values were filled.

This cleaned dataset (yıllara_göre_egitim.xlsx) was then used for further processing in Python. 🚀

⚙️ Step 1: Data Loading and Cleaning

The dataset is loaded using Python’s pandas library, and completely empty columns are removed.

import pandas as pd

# Load the dataset and remove completely empty columns
df = pd.read_excel('data/yıllara_göre_egitim.xlsx')
df = df.dropna(axis=1, how='all')

Description:
- The dataset is loaded using the pandas library.
- Completely empty columns are removed from the dataset.


📅 Step 2: Partitioning Data by Years

The dataset is partitioned into separate data frames for each year.

# Define years and index ranges
years = list(range(2008, 2024))  # From 2008 to 2023
index_ranges = [(1 + i * 82, 1 + (i + 1) * 82) for i in range(len(years))]

# Create sub-data frames for each year
yearly_data = {}
for year, (start, end) in zip(years, index_ranges):
    yearly_data[year] = df.iloc[start:end]

Description:
- The dataset is partitioned into yearly data frames.
- Index ranges are calculated for each year.


🔍 Step 3: Viewing Sample Years

Sample records from the first three years are examined.

# Inspect sample data frames from the first few years
sample_years = {year: data.head(1) for year, data in list(yearly_data.items())[:3]}

# Display sample data for verification
for year, sample in sample_years.items():
    print(f"Sample data for year {year}:")
    print(sample)

Description:
- Data frames for the first three years are inspected.
- Header structure is validated.
- Sample data from the first row of each year’s data frame is displayed.


📝 Step 4: Adjusting Column Headers

The first two rows are used as column headers, and missing values are filled.

import pandas as pd
df = pd.read_excel('data/yıllara_göre_egitim.xlsx')
df = df.dropna(axis=1, how='all')
years = list(range(2008, 2024))  # From 2008 to 2023
index_ranges = [(1 + i * 82, 1 + (i + 1) * 82) for i in range(len(years))]

# Create sub-data frames for each year
yearly_data = {}
for year, (start, end) in zip(years, index_ranges):
    yearly_data[year] = df.iloc[start:end]
# Use the first two rows as headers and fill missing values
header1 = df.iloc[0].bfill()
header2 = df.iloc[1].bfill()

# Create multi-level (MultiIndex) headers
df.columns = pd.MultiIndex.from_tuples(zip(header1, header2))

# Remove the first two rows and reset the dataset
df = df.iloc[2:].reset_index(drop=True)

Description:
- The first two rows are used as column headers.
- Missing values are filled (bfill).
- Multi-level headers are created.


📑 Step 5: Structuring Column Headers

Column headers are further structured.

Code Details
import pandas as pd
df = pd.read_excel('data/yıllara_göre_egitim.xlsx')
df = df.dropna(axis=1, how='all')
years = list(range(2008, 2024))  # From 2008 to 2023
index_ranges = [(1 + i * 82, 1 + (i + 1) * 82) for i in range(len(years))]

# Create sub-data frames for each year
yearly_data = {}
for year, (start, end) in zip(years, index_ranges):
    yearly_data[year] = df.iloc[start:end]
# Use the first two rows as headers and fill missing values
header1 = df.iloc[0].bfill()
header2 = df.iloc[1].bfill()

# Create multi-level (MultiIndex) headers
df.columns = pd.MultiIndex.from_tuples(zip(header1, header2))

# Remove the first two rows and reset the dataset
df = df.iloc[2:].reset_index(drop=True)
# Create single-level headers for the first three columns
first_three_headers = ['Year', 'Province Code', 'Province Name']

# Define main and sub-headers
main_headers = [
    'Genel toplam', 'Okuma yazma bilmeyen', 'Okuma yazma bilen fakat bir okul bitirmeyen',
    'İlkokul', 'İlköğretim', 'Ortaokul ve dengi meslek', 'Lise ve dengi meslek',
    'Yüksekokul veya fakülte', 'Yüksek lisans', 'Doktora', 'Bilinmeyen'
]
sub_headers = ['Toplam', 'Erkek', 'Kadın']

# Create multi-level headers
remaining_headers = [(main, sub) for main in main_headers for sub in sub_headers]

# Combine all headers
final_headers = first_three_headers + remaining_headers[:df.shape[1] - len(first_three_headers)]

df.columns = pd.MultiIndex.from_tuples(
    [(header, '') if isinstance(header, str) else header for header in final_headers]
)

Description:
- Single-level headers are assigned to the first three columns.
- Multi-level headers are created for remaining columns.
- Final column structure is established.


📊 Step 6: Data Structure Verification

# Display dataset information
print(df.info())

# Display the first few rows of the dataset
print(df.head())
Code Details
import pandas as pd

# Veri setini yükle ve tamamen boş sütunları kaldır
df = pd.read_excel('/home/mami/emu430/emu430-fall2024-team-education_atlas/data/yıllara_göre_egitim.xlsx')
df = df.dropna(axis=1, how='all')

# Yılları ve indeks aralıklarını tanımla
years = list(range(2008, 2024))  # 2008'den 2023'e kadar
index_ranges = [(1 + i * 82, 1 + (i + 1) * 82) for i in range(len(years))]

# Her yıl için alt veri çerçevelerini oluştur
yearly_data = {}
for year, (start, end) in zip(years, index_ranges):
    yearly_data[year] = df.iloc[start:end]

# İlk birkaç yılın veri çerçevelerini kontrol et
sample_years = {year: data.head(1) for year, data in list(yearly_data.items())[:3]}

# İlk iki satırı başlık olarak kullan ve eksik değerleri doldur
header1 = df.iloc[0].bfill()
header2 = df.iloc[1].bfill()

# Çok seviyeli (MultiIndex) başlıklar oluştur
df.columns = pd.MultiIndex.from_tuples(zip(header1, header2))

# İlk iki satırı kaldır ve veri setini sıfırla
df = df.iloc[2:].reset_index(drop=True)

# İlk 3 sütun için tek seviyeli başlıklar oluştur
first_three_headers = ['Yıl', 'İl Kodu', 'İl Adı']

# Ana başlıklar ve alt başlıkları tanımla
main_headers = [
    'Genel toplam', 'Okuma yazma bilmeyen', 'Okuma yazma bilen fakat bir okul bitirmeyen',
    'İlkokul', 'İlköğretim', 'Ortaokul ve dengi meslek', 'Lise ve dengi meslek',
    'Yüksekokul veya fakülte', 'Yüksek lisans', 'Doktora', 'Bilinmeyen'
]
sub_headers = ['Toplam', 'Erkek', 'Kadın']

# Çok seviyeli başlıkları oluştur
remaining_headers = [(main, sub) for main in main_headers for sub in sub_headers]

# Tüm başlıkları birleştir
final_headers = first_three_headers + remaining_headers[:df.shape[1] - len(first_three_headers)]

df.columns = pd.MultiIndex.from_tuples(
    [(header, '') if isinstance(header, str) else header for header in final_headers]
)

print(df.loc[df['Yıl'] == 2009])
      Yıl İl Kodu            İl Adı Genel toplam                      \
                                          Toplam     Erkek     Kadın   
81   2009     NaN           Türkiye     65049093  32612307  32436786   
82   2009       1             Adana      1838483    914799    923684   
83   2009       2          Adıyaman       514226    256220    258006   
84   2009       3    Afyonkarahisar       632466    312888    319578   
85   2009       4              Ağrı       453257    236508    216749   
..    ...     ...               ...          ...       ...       ...   
158  2009      77            Yalova       186120     93011     93109   
159  2009      78           Karabük       202539    101087    101452   
160  2009      79             Kilis       106205     52316     53889   
161  2009      80          Osmaniye       416123    208100    208023   
162  2009      81             Düzce       304102    151908    152194   

    Okuma yazma bilmeyen                   \
                  Toplam   Erkek    Kadın   
81               4672257  915054  3757203   
82                148634   29472   119162   
83                 64877   14374    50503   
84                 46634    7243    39391   
85                 65454   15413    50041   
..                   ...     ...      ...   
158                 6853    1197     5656   
159                16355    3022    13333   
160                11429    2476     8953   
161                36910    7244    29666   
162                18403    3211    15192   

    Okuma yazma bilen fakat bir okul bitirmeyen  ... Yüksekokul veya fakülte  \
                                         Toplam  ...                   Kadın   
81                                     13517214  ...                 1786379   
82                                       398108  ...                   48596   
83                                       135198  ...                    5811   
84                                       124173  ...                   11334   
85                                       188673  ...                    2408   
..                                          ...  ...                     ...   
158                                       30755  ...                    5630   
159                                       34413  ...                    5071   
160                                       27230  ...                    1478   
161                                       93989  ...                    7797   
162                                       59972  ...                    5868   

    Yüksek lisans                 Doktora               Bilinmeyen           \
           Toplam   Erkek   Kadın  Toplam  Erkek  Kadın     Toplam    Erkek   
81         279268  166285  112983   95500  61301  34199    3032457  1657575   
82           5570    3173    2397    2044   1312    732      80015    45750   
83            567     406     161     151    120     31      22234    12751   
84           1610    1018     592     588    447    141      15534     8827   
85            459     344     115     151    101     50      48889    26122   
..            ...     ...     ...     ...    ...    ...        ...      ...   
158           673     445     228     129     88     41      13696     7229   
159           456     309     147     172    133     39       5506     3070   
160           126      86      40      65     55     10       4504     2522   
161           591     396     195     106     68     38      17466     9477   
162           668     420     248     265    191     74      10764     5716   

              
       Kadın  
81   1374882  
82     34265  
83      9483  
84      6707  
85     22767  
..       ...  
158     6467  
159     2436  
160     1982  
161     7989  
162     5048  

[82 rows x 36 columns]

🚀 R’de .RData formatında kaydetmek için aşağıdaki komutları kullanabilirsiniz:

Code Details
import pandas as pd

# Veri setini yükle ve tamamen boş sütunları kaldır
df = pd.read_excel('/home/mami/emu430/emu430-fall2024-team-education_atlas/data/yıllara_göre_egitim.xlsx')
df = df.dropna(axis=1, how='all')

# Yılları ve indeks aralıklarını tanımla
years = list(range(2008, 2024))  # 2008'den 2023'e kadar
index_ranges = [(1 + i * 82, 1 + (i + 1) * 82) for i in range(len(years))]

# Her yıl için alt veri çerçevelerini oluştur
yearly_data = {}
for year, (start, end) in zip(years, index_ranges):
    yearly_data[year] = df.iloc[start:end]

# İlk birkaç yılın veri çerçevelerini kontrol et
sample_years = {year: data.head(1) for year, data in list(yearly_data.items())[:3]}

# İlk iki satırı başlık olarak kullan ve eksik değerleri doldur
header1 = df.iloc[0].bfill()
header2 = df.iloc[1].bfill()

# Çok seviyeli (MultiIndex) başlıklar oluştur
df.columns = pd.MultiIndex.from_tuples(zip(header1, header2))

# İlk iki satırı kaldır ve veri setini sıfırla
df = df.iloc[2:].reset_index(drop=True)

# İlk 3 sütun için tek seviyeli başlıklar oluştur
first_three_headers = ['Yıl', 'İl Kodu', 'İl Adı']

# Ana başlıklar ve alt başlıkları tanımla
main_headers = [
    'Genel toplam', 'Okuma yazma bilmeyen', 'Okuma yazma bilen fakat bir okul bitirmeyen',
    'İlkokul', 'İlköğretim', 'Ortaokul ve dengi meslek', 'Lise ve dengi meslek',
    'Yüksekokul veya fakülte', 'Yüksek lisans', 'Doktora', 'Bilinmeyen'
]
sub_headers = ['Toplam', 'Erkek', 'Kadın']

# Çok seviyeli başlıkları oluştur
remaining_headers = [(main, sub) for main in main_headers for sub in sub_headers]

# Tüm başlıkları birleştir
final_headers = first_three_headers + remaining_headers[:df.shape[1] - len(first_three_headers)]

df.columns = pd.MultiIndex.from_tuples(
    [(header, '') if isinstance(header, str) else header for header in final_headers]
)

# Veri setini bir dosyaya kaydet
import pyreadr

# Veri çerçevesini Rdata olarak kaydet
pyreadr.write_rds('data/education_data.RData', df)
Code Details
# Veri setini bir dosyaya kaydet
import pyreadr

# Veri çerçevesini Rdata olarak kaydet
pyreadr.write_rds('education_data.RData', df)

R’de kullanılmak üzere açıklama

✅ Veri seti başarıyla kaydedildi: “data/education_data.RData”
Rdata

Veri Seti Bağlantısı

Veri setine buradan ulaşabilirsiniz.

Description:
- The dataset’s column structure is inspected.
- The first few rows are displayed to validate adjustments.


✅ Summary and Results

After the preprocessing steps:

  1. The dataset was successfully loaded.
  2. Yearly data partitioning was completed.
  3. Multi-level column headers were structured.
  4. Sample data from selected years was displayed for verification.

🚀 Next Steps

  • Missing value analysis and filling strategies
  • Data normalization and scaling
  • Visualization and analysis steps

This report provides a detailed overview of the data preprocessing steps and prepares the dataset for further analysis. 🚀

Back to top